Introduction

The fishHook R package enables agile statistical analysis of coding and non-coding mutational recurrence in cancer through generalized linear modeling (GLM) of somatic mutation densities and their heterogeneity along the genome. fishHook can be applied to the analysis of any collection of genomic intervals (e.g. genes, enhancers, promoters, genomic tiles) or complex sets of intervals (e.g. genes sets representing pathways, enhancers sets known to interact with a target gene). The FishHook package is integrated with GenomicRanges and data.table packages, allowing incroporation into bioinformatics / genomic data science workflows in the R/Bioconductor ecosystem.

The core of the fishHook framework is to enable nomination of loci following the correction of known covariates of neutral mutation, e.g. chromatin state, replication timing, and nucleotide context. The goal of fishHook is to identify cancer drivers, i.e. loci that are under positive somatic selection and accumulate mutations above “background”. This analysis hinges on the application of a correct null / background model that yields straight Q-Q plots.

Though we provide many pre-computed covariates, the key power of fishHook is to empower the user to add custom covariates and (more broadly) provide a framework for the generation and fitting of custom models to nominate loci, e.g. modeling variant- and tumor-type specific background mutational processes.

Please include this citation if you use fishHook: Imielinski, Guo, Meyerson Cell. 2017 Jan 26;168(3):460-472

Driver discovery in cancer whole exomes

We will demonstrate a quick whole exome analysis using public TCGA lung adenocarcinoma mutation data. Additional packages like gTrack and rtracklayer will help with data import and visualization, but are not necessary to run fishHook.

library(fishHook)    
library(gTrack)
library(rtracklayer)

Read in data

Read in the mutation data and additional tracks that we will use in this analysis.

## mutation calls cached from public GDAC Broad firehose https://bit.ly/2sFxWY6
mutations = dt2gr(fread('http://mskilab.com/fishHook/hg19/luad.maf'))

## GENCODE v19 genes these are our "hypotheses"
genes = gr.sub(import('http://mskilab.com/fishHook/hg19/gencode.v19.genes.gtf')) # rtracklayer::import reads gtf and gr.sub replaces chr
genes = genes %Q% (gene_type == 'protein_coding') %Q% (level<3)  # %Q% is a gUtils subsetting operator for GRanges

## protein coding CDS definitions
cds = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/gencode.v19.cds.rds')))

## bigWig file of fractional coverage of hg19 positions by Agilent exome
## we will use this in combination with cds to define eligible positions
exomecov = import('http://mskilab.com/fishHook/hg19/exome_coverage.bw')

Take a peek at our mutations

head(mutations[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
## GRanges object with 6 ranges and 5 metadata columns:
##       seqnames    ranges strand |         Tumor_Sample_Barcode
##          <Rle> <IRanges>  <Rle> |                  <character>
##   [1]        1    905907      * | TCGA-05-4249-01A-01D-1105-08
##   [2]        1   1192480      * | TCGA-05-4249-01A-01D-1105-08
##   [3]        1   1854885      * | TCGA-05-4249-01A-01D-1105-08
##   [4]        1   9713992      * | TCGA-05-4249-01A-01D-1105-08
##   [5]        1  12908093      * | TCGA-05-4249-01A-01D-1105-08
##   [6]        1  17257855      * | TCGA-05-4249-01A-01D-1105-08
##       Variant_Type Variant_Classification Reference_Allele
##        <character>            <character>      <character>
##   [1]          SNP      Missense_Mutation                A
##   [2]          SNP                 Silent                C
##   [3]          SNP                    IGR                G
##   [4]          SNP                 Intron                G
##   [5]          SNP      Missense_Mutation                C
##   [6]          SNP      Missense_Mutation                C
##       Tumor_Seq_Allele2
##             <character>
##   [1]                 T
##   [2]                 A
##   [3]                 C
##   [4]                 A
##   [5]                 A
##   [6]                 T
##   -------
##   seqinfo: 25 sequences from an unspecified genome

Instantiate FishHook object from events and eligible territory

First we define the “eligible territory”. This is a key component of all somatic mutational recurrence analyses, since much of the genome is not covered in sequencing studies. For example, in a whole exome sequencing dataset, less than 2% of the genome is reliably captured. In a targeted sequencing panel, this fraction will be even smaller. Even in whole genome sequencing using Illumina short reads, only a subset (70%) of the genome is reliably callable (Heng Li Bioinformatics 2014 Oct 15;30(20):2843–8751)

Eligible territory coverage will influence the “denominator” of our recurrence analysis, i.e. the number of positions in each hypothesis interval where a mutation could have possibly been detected. If we do not take eligible territory into account we will mis-estimate the background mutation rate in a given region.

To define eligible territory for this whole exome analysis, we will choose the portion of cds (protein coding) bases that are captured in at least 95% of whole exomes, which represents about 26MB of genome.

eligible = exomecov %Q% which(score>0.95) # %Q% is a gUtils operator for subsetting on GRanges metadata
eligible = reduce(intersect(eligible, cds, ignore.strand = TRUE)) # we intersect and reduce / collapse our prelim eligible intervals with CDS boundaries

We define “events” as nonsynonymous mutations. In this simple model, we will lump together SNVs (of different flavors), and indels (of different flavors). (We discuss more complex models that subdivide mutation types later in the tutorial).

events = mutations %Q% (Variant_Classification != 'Silent') 

Now that we have loaded our hypotheses (i.e. genes), events, and eligible, we are ready to create and analyze a basic FishHook object. Under the hood, object creation triggers counting of how many events are in the eligible portion of each hypothesis interval. We provide the idcol parameter so that each tumor sample (as defined by the Tumor_Sample_Barcode column in the events GRanges) will provide at most one event to the counts of each interval.

fish = Fish(hypotheses = genes[, 'gene_name'], events = events, eligible = eligible, idcol = 'Tumor_Sample_Barcode')
## Contains 19796 hypotheses.
## Contains 56809 events to map to hypotheses.
## Spanning 24.58 MB of eligible territory.
## Covariates:
## 
## Hypotheses contains 1 metadata columns.
## Current State: Annotated

Run basic model without covariates

We can score this basic FishHook object, i.e. compute p values for every hypothesis, using a simple glm that models a uniform mutation density along the genome, i.e. the glm fits only an intercept and applies no covariates (after correcting for the number of eligible bases in each interval).

The $res field of the FishHook contains a data.table of scoring results. $res has one row per input hypothesis, with p values, FDRs, and additional interval annotations provided with the hypotheses GRanges.

fish$score()
head(fish$res[order(p), ])
fish$qqp()
seqnames start end width strand gene_name p fdr count effectsize count.pred count.density count.pred.density p.neg fdr.neg
17 7565097 7590856 25760
TP53 0.0e+00 0.0000 106 5.62 2.15 0.0941 0.00191 1 1
2 79384132 79386879 2748
REG3A 1.0e-07 0.0009 17 4.09 1.00 0.0324 0.00191 1 1
19 10596796 10614417 17622
KEAP1 5.0e-07 0.0031 30 3.54 2.57 0.0222 0.00191 1 1
5 24487209 24645087 157879
CDH10 3.6e-06 0.0180 41 3.19 4.50 0.0174 0.00191 1 1
3 147103833 147124647 20815
ZIC4 5.5e-06 0.0200 18 3.41 1.69 0.0203 0.00191 1 1
8 88882973 88886296 3324
DCAF4L2 6.4e-06 0.0200 22 3.28 2.26 0.0186 0.00191 1 1


You will notice that this Q-Q plot appears curved and inflated, though its slope (lambda) is reasonably near 1. The low alpha value (MLE of the alpha parameter of the Gamma distribution), suggests that the GLM is detecting additional variance in the data that is unmodeled by a pure Poisson regression. Adding covariates to the model should improve the quality of the fit.

The top hits in the plot (you can hover over them) include TP53 but also many unlikely cancer gene candidates. Among these are olfactory receptors, which are located in late replicating regions of the human genome and thus accumulate neutral mutations more frequently (Lawrence et al 2013 Nature Jul 11;499(7457):214-218).

Add covariates to FishHook model

To address these issues, we will load in data specifying replication timing, chromatin state, and nucleotide context. These are all important determinants of somatic neutral mutation density. We load these data in as GRanges objects using functions in data.table, rtracklayer, and gUtils packages (however you are free to use any GRanges import utility of your choice).

We first load in replication timing data as a GRanges then instantiate it as a Covariate Replication timing information is contained in the $score metadata field of reptime. We instantiate it as a covariate of type “numeric” by specifying field score.

## replication timing for NHEK obtained from  https://bit.ly/2sRsXT9 and
## converted to rds via rtracklayer::import
reptimedata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/RT_NHEK_Keratinocytes_Int92817591_hg19.rds')))

## instantiate covariate around 'score' field, name the Covariate "replication timing"
reptime = Cov(data = reptimedata, field = 'score', name = 'ReplicationTiming') 

Below, context is a GRanges object with 98 columns representing tri, di, and mononucleotide context counts in the hg19 genome. Code for computing context (e.g. for another genome) is provided here.

We instantiate a numeric Covariate object from context, choosing only two of the columns here to take into account G and C content. Note that the covariate object can be vectorized (concatenated, subsetted) and instantiated around several columns of an input GRanges. As a result, contextcov will be length 2 (representing G and C nucleotides fraction).

context = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/nucleotide.context.rds')))
gc = Cov(data = context, field = c('C', 'G')) ## instantiate Covariate around G and C fields

Finally we load in chromHMM data for cell line A549 from Epigenomics Roadmap. We will want to create a covariate that will model the fraction of heterochromatic and quiescent regions in each query interval.

To do so, we will create an “interval” covariate by not specifying a metadata field.

### data cached from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E114_15_coreMarks_mnemonics.bed.gz
chromhmm = gr.sub(import('http://mskilab.com/fishHook/hg19/E114_15_coreMarks_mnemonics.bed.gz'), 'chr', '') ## import from bed then gUtils::gr.sub to strip 'chr' identifier
hetchromdata = chromhmm %Q% (name %in% c('8_ZNF/Rpts', '9_Het', '15_Quies')) # %Q% is gUtils operator for subsetting on GRanges metadata, in this case selecting for heterochromatic regions
hetchrom = Cov(hetchromdata, name = 'Heterochromatin') ## instantiate interval covariate

We now add these covariates to the model. For type numeric covariates, e.g. replication timing, this will trigger the calculation of the average value of each covariate within the eligible subset of each hypothesis interval. and the fractional overlap of of its eligible subset This annotation is the most computationally intensive and slowest aspect of fishHook analyses, though occurs within a few seconds for this small number of covariates.

## note how we can concatenate Covariate objects with c() operator
fish$covariates = c(reptime, gc, hetchrom)
## FishHook 2018-06-12 17:28:24: Aggregating covariates over eligible subset of hypotheses
## FishHook 2018-06-12 17:28:24: Overlapping with covered intervals
## FishHook 2018-06-12 17:28:24: Finished overlapping with covered intervals
## FishHook 2018-06-12 17:28:24: Annotating track ReplicationTiming
## FishHook 2018-06-12 17:28:27: Annotating track C
## FishHook 2018-06-12 17:28:30: Annotating track G
## FishHook 2018-06-12 17:28:32: Annotating track Heterochromatin
## Warning in self$clear(): Resetting scores since covariates re-defined,
## please run $score() to get updated p values

Now that we’ve added covariates, we can rescore fish and compute p values. Looking at these results, we immediately see an improvement in lambda (closer to one) and alpha (increased), and the nominated gene list (no more olfactory receptors, We see also a reasonable number of significant (fdr<0.1) genes at the top of the list (or at the top right of the QQ plot) that have been biologically implicated in lung adenocarcinoma tumorigenesis.

fish$score()
fish$res[fdr<0.25, ][order(p), ]
fish$qqp()
seqnames start end width strand gene_name p fdr count effectsize count.pred count.density count.pred.density p.neg fdr.neg
17 7565097 7590856 25760
TP53 0.0e+00 0.00000 106 6.27 1.370 0.09410 0.001210 1 1
19 10596796 10614417 17622
KEAP1 0.0e+00 0.00000 30 4.02 1.840 0.02220 0.001370 1 1
7 55086714 55324313 237600
EGFR 1.0e-07 0.00054 44 3.06 5.280 0.01200 0.001440 1 1
7 142457319 142460923 3605
PRSS1 6.9e-06 0.03400 16 3.06 1.910 0.02160 0.002580 1 1
1 157800704 157868046 67343
CD5L 8.7e-06 0.03400 14 3.07 1.670 0.01340 0.001610 1 1
21 44513066 44527697 14632
U2AF1 1.8e-05 0.05400 8 3.38 0.768 0.01180 0.001130 1 1
14 19553365 19590078 36714
POTEG 2.0e-05 0.05400 13 2.99 1.640 0.02210 0.002790 1 1
11 62443970 62446567 2598
UBXN1 2.2e-05 0.05400 10 3.28 1.030 0.01140 0.001180 1 1
1 175284330 175712906 428577
TNR 2.8e-05 0.05700 38 2.49 6.740 0.00933 0.001660 1 1
6 136578001 136610989 32989
BCLAF1 2.9e-05 0.05700 23 2.66 3.650 0.00834 0.001320 1 1
7 140419127 140624564 205438
BRAF 3.6e-05 0.06500 21 2.66 3.320 0.00984 0.001560 1 1
8 10463859 10569697 105839
RP1L1 4.6e-05 0.07600 48 2.34 9.450 0.00791 0.001560 1 1
19 10416103 10426691 10589
FDX1L 5.3e-05 0.08100 4 4.02 0.247 0.01650 0.001020 1 1
19 10416103 10426685 10583
CTD-2369P2.10 9.1e-05 0.13000 4 4.02 0.247 0.01650 0.001020 1 1
17 46605888 46608359 2472
HOXB1 1.2e-04 0.16000 8 2.93 1.050 0.00896 0.001180 1 1
1 175036994 175117202 80209
TNN 1.4e-04 0.17000 26 2.35 5.110 0.00742 0.001460 1 1
2 113730780 113743242 12463
IL36G 1.9e-04 0.22000 5 3.34 0.495 0.00986 0.000977 1 1
19 43511808 43530664 18857
PSG11 2.3e-04 0.23000 11 2.65 1.750 0.01090 0.001740 1 1
2 131486643 131487909 1267
GPR148 2.4e-04 0.23000 10 2.82 1.410 0.00992 0.001400 1 1
7 100663353 100702020 38668
MUC17 2.4e-04 0.23000 70 2.10 16.300 0.00519 0.001210 1 1
19 43256838 43359843 103006
PSG8 2.4e-04 0.23000 12 2.64 1.920 0.00929 0.001490 1 1


Note that this model is still quite rudimentary (e.g. we have lumped together SNVs and indels, we have not substratified SNVs by mutational context, we have employed very few covariates) but we still get a reasonable gene list and minimal statistical inflation.

Merge additional covariates

We may notice that some of the significantly mutated genes are not known to be expressed in lung adenocarcinoma tissues. We also know that genes that are more highly expressed in a given tissue have lower mutation rates.
To account for this we can use gene expression as both a filter and a covariate in the model.

## load GRanges of lung adenocarcinoma average gene experssion
exprdata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/lung.expr.rds'))) 

# log transform expression values
exprdata$log.tpm = log10(exprdata$tpm+0.01) 

## create Covariate for log gene expression, using log.tpm field in the exprdata object
expr = Cov(exprdata, field = 'log.tpm', name = 'LungExpression') 

## merge / append this new covariate into the FishHook object
fish$merge(expr) 
## FishHook 2018-06-12 17:28:38: Appending new Covariates to this fishHook object
## FishHook 2018-06-12 17:28:38: Overlapping with covered intervals
## FishHook 2018-06-12 17:28:38: Finished overlapping with covered intervals
## FishHook 2018-06-12 17:28:38: Annotating track LungExpression
## Warning in self$clear(): Resetting scores since covariates re-defined,
## please run $score() to get updated p values

Subset the FishHook object

Now that we’ve merged in this model we can use the subsetting feature of fishHook model. The dimension of the fish object is hypotheses by covariates. You can use the [ operator to subset on either of these dimensions. We will use this functionality to subset the object to contain only genes that are expressed >10 TPM in lung, and then re-score this subsetted model. To subset, we use the fish$data object which is a data.table of all all of the covariate data, with one row per hypothesis.

fish = fish[which(fish$data$LungExpression>1), ] ## subset for high lung expression
## Contains 8765 hypotheses.
## Contains 56809 events to map to hypotheses.
## Spanning 24.58 MB of eligible territory.
## Covariates:
## ReplicationTiming
## C
## G
## Heterochromatin
## LungExpression
## Hypotheses contains 1 metadata columns.
## Current State: Annotated

This object now contains 8765 hypotheses (i.e. genes), which we can rescore to obtain p values and FDRs.

fish$score()
fish$res[fdr<0.25, ][order(p), ]
fish$qqp()
seqnames start end width strand gene_name p fdr count effectsize count.pred count.density count.pred.density p.neg fdr.neg
17 7565097 7590856 25760
TP53 0.0e+00 0.0e+00 106 6.30 1.340 0.09410 0.00119 1 1
19 10596796 10614417 17622
KEAP1 0.0e+00 0.0e+00 30 4.17 1.660 0.02220 0.00123 1 1
7 55086714 55324313 237600
EGFR 0.0e+00 1.5e-06 44 3.24 4.660 0.01200 0.00127 1 1
7 140419127 140624564 205438
BRAF 1.1e-06 2.4e-03 21 2.92 2.770 0.00984 0.00130 1 1
6 136578001 136610989 32989
BCLAF1 4.3e-06 7.2e-03 23 2.77 3.370 0.00834 0.00122 1 1
11 62443970 62446567 2598
UBXN1 4.9e-06 7.2e-03 10 3.29 1.020 0.01140 0.00117 1 1
21 44513066 44527697 14632
U2AF1 7.2e-06 9.0e-03 8 3.46 0.727 0.01180 0.00107 1 1
7 41724712 41742706 17995
INHBA 1.0e-04 1.1e-01 13 2.55 2.220 0.01020 0.00174 1 1
6 291630 351355 59726
DUSP22 1.2e-04 1.2e-01 10 2.86 1.380 0.01750 0.00242 1 1
X 47004268 47046212 41945
RBM10 1.4e-04 1.2e-01 7 3.01 0.867 0.01330 0.00165 1 1
4 177604689 177713881 109193
VEGFC 1.9e-04 1.4e-01 10 2.70 1.540 0.00928 0.00143 1 1
21 27838528 27945603 107076
CYYR1 1.9e-04 1.4e-01 6 3.23 0.639 0.01250 0.00133 1 1
1 155829300 155854990 25691
SYT11 2.4e-04 1.6e-01 10 2.53 1.730 0.00773 0.00133 1 1
5 17065707 17276943 211237
BASP1 2.9e-04 1.8e-01 4 3.83 0.281 0.01810 0.00127 1 1


To inspect the parameters of this model and see which features it is using, we can employ the$modelaccessor:

summary(fish$model)
## 
## Call:
## glm.nb(formula = formula, data = as.data.frame(tdt), maxit = iter, 
##     init.theta = 4.365024999, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7879  -1.0043  -0.2859   0.4341  16.5066  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -6.82434    0.07713 -88.476  < 2e-16 ***
## ReplicationTiming -0.29045    0.01875 -15.490  < 2e-16 ***
## C                  0.07257    0.58168   0.125 0.900716    
## G                  0.91931    0.58858   1.562 0.118308    
## Heterochromatin    0.36845    0.03111  11.843  < 2e-16 ***
## LungExpression     0.09280    0.02698   3.440 0.000582 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.365) family taken to be 1)
## 
##     Null deviance: 9605.2  on 8764  degrees of freedom
## Residual deviance: 9078.6  on 8759  degrees of freedom
## AIC: 27720
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.365 
##           Std. Err.:  0.217 
## 
##  2 x log-likelihood:  -27706.192

We can see from the Estimate and Pr(|>z|) columns of the Coefficients table that replication timing and lung gene expression are significantly negatively correlated and heterochromatin is significantly positively correlated with mutational density (as expected). However, this table shows that G and C content are not significantly correlated with mutation density. We can use the column subsetting function to remove these covariates and rescore to see how the results change.

fish2 = fish[, -c(2:3)] ## remove the 2nd and 3rd covariates ('G', 'C')
fish2$score() ## rescore
## FishHook 2018-06-12 17:28:44: Setting up problem
## FishHook 2018-06-12 17:28:44: Fitting model with 8,765 data points and 3 covariates
## FishHook 2018-06-12 17:28:45: Computing p values for 8765 hypotheses.

## check if our new top genes are identical to the previous - they are
identical(fish2$res[fdr<0.25, sort(gene_name)], fish$res[fdr<0.25, sort(gene_name)])
## [1] TRUE

The results are identical with and without G and C covariates. This urther suggests that these covariates are not necessary in the model and can likely be excluded.


Analyze Reactome pathways

Read in and parse reactome pathways into list of gene symbols, then match against genes in our model.

## parse Reactome pathways from .gmt format
pathways = strsplit(readLines('http://mskilab.com/fishHook/hg19/ReactomePathways.gmt'), '\t')
pathways = structure(lapply(pathways, '[', -1), names = sapply(pathways, '[', 1))

## match them to create sets of indices as a named list
sets = sapply(sapply(pathways, match, fish$hypotheses$gene_name), setdiff, NA)

Here is what the pathways and sets look like:

head(pathways[1:2])
## $`2-LTR circle formation`
##  [1] "R-HSA-164843" "BANF1"        "HMGA1"        "LIG4"        
##  [5] "PSIP1"        "XRCC4"        "XRCC5"        "XRCC6"       
##  [9] "gag"          "gag-pol"      "rev"          "vif"         
## [13] "vpr"          "vpu"         
## 
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "R-HSA-73843" "PRPS1"       "PRPS1L1"     "PRPS2"

head(sets[1:2])
## $`2-LTR circle formation`
## [1] 5005 2973 5852 4018 2573 1523 8471
## 
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] 8672 8535


The list sets contains integer vectors that index fish$hypotheses.

To run a set analysis, we just set the $sets variable in the FishHook object. This triggers scoring of hypothesis sets (in this case gene sets). The results of the set analysis are shown in $setres variable.

The set analysis is a bit more computationally intensive. We can speed things up through parallelization (setting `fish$mc.cores = 5).

fish$mc.cores = 5

## this triggers scoring of gene sets using covariate corrected model
fish$sets = sets

## list top sets
fish$setres[order(p)[1:5], ]
setname p fdr effect estimate p.left p.twosided
TP53 Regulates Transcription of Genes Involved in G1 Cell Cycle Arrest 8.91e-36 0 10.8 [7.39-15.6] 10.80 1 0
TP53 regulates transcription of several additional cell death genes whose specific roles in p53-dependent apoptosis remain uncertain 1.08e-35 0 9.29 [6.54-13.2] 9.29 1 0
Regulation of TP53 Activity through Association with Co-factors 1.52e-35 0 10.7 [7.35-15.5] 10.70 1 0
Formation of Senescence-Associated Heterochromatin Foci (SAHF) 6.98e-35 0 8.67 [6.14-12.2] 8.67 1 0
PI5P Regulates TP53 Acetylation 1.27e-34 0 11.4 [7.73-16.9] 11.40 1 0


Examining the results table, we can see that most of the significant pathways appear related to TP53. Indeed, if we inspect the hypotheses (i.e. genes) contributing to these these gene sets, we will see that they are dominated by TP53 and 1-2 additional genes. For example:

## pick a set 
mysetname = fish$setres[order(p)[1], setname]

## use sets list to index fish results, sorting by p value
fish$res[fish$sets[[mysetname]], ][order(p), ]
seqnames start end width strand gene_name p fdr count effectsize count.pred count.density count.pred.density p.neg fdr.neg
17 7565097 7590856 25760
TP53 2.4e-62 0.00 106 6.30 1.340 0.09410 0.00119 1.00 1
19 30302805 30315215 12411
CCNE1 0.019 0.97 5 1.71 1.530 0.00478 0.00147 0.98 1
12 12867992 12875305 7314
CDKN1B 0.1 0.97 2 1.92 0.529 0.00403 0.00107 0.98 1
4 122737599 122745087 7489
CCNA2 0.21 0.97 2 0.55 1.370 0.00184 0.00126 0.82 1
3 51991470 52002426 10957
PCBP4 0.4 0.97 2 0.30 1.630 0.00182 0.00148 0.76 1
20 32263489 32274210 10722
E2F1 0.56 0.97 0 -Inf 0.894 0.00000 0.00112 0.44 1
12 56360553 56366568 6016
CDK2 0.66 0.97 0 -Inf 0.881 0.00000 0.00111 0.45 1
12 54762917 54785082 22166
ZNF385A 0.73 0.97 0 -Inf 0.856 0.00000 0.00125 0.46 1
6 36644305 36655116 10812
CDKN1A 0.93 0.99 0 -Inf 0.456 0.00000 0.00118 0.65 1


This is a common challenge with pathway analysis of mutations, since many cancer pathways are usually driven by a single “celebrity” gene.

We can dig a little deeper to identify significant gene sets that do not have TP53. Here is one approach

## these are gene sets with TP53
has.tp53 = sapply(fish$sets, function(x) any(fish$hypotheses$gene_name[x] %in% 'TP53'))

## we subset on gene sets that do not have TP53
head(fish$setres[!has.tp53, ][order(p), ])
setname p fdr effect estimate p.left p.twosided
Extracellular matrix organization 2.48e-09 1e-07 1.38 [1.24-1.54] 1.38 1 0
GRB2 events in EGFR signaling 5.58e-09 2e-07 4.58 [2.72-7.72] 4.58 1 0
GRB2 events in ERBB2 signaling 8.09e-09 2e-07 3.67 [2.34-5.75] 3.67 1 0
EGFR Transactivation by Gastrin 1.74e-08 5e-07 4.14 [2.5-6.86] 4.14 1 0
SHC1 events in EGFR signaling 1.91e-08 5e-07 3.97 [2.43-6.49] 3.97 1 0
Constitutive Signaling by EGFRvIII 2.11e-08 5e-07 2.7 [1.89-3.86] 2.70 1 0


These sets appear interesting and are related to additional biological processes known to be important, such EGFR signaling. The top gene set “Extracellular matrix organization” appears interesting, because it is not obviously associated with known mutational driver pathways in lung adenocarcinoma. Let’s examine genes contributing to this top gene set:

## take the top non-TP53 associated gene set 
mysetname = fish$setres[!has.tp53, ][order(p)[1], setname]

## use sets list to pull the individual hypothesis results for members of this set
head(fish$res[fish$sets[[mysetname]], ][order(p), ])
seqnames start end width strand gene_name p fdr count effectsize count.pred count.density count.pred.density p.neg fdr.neg
2 189896622 190044605 147984
COL5A2 0.0041 0.85 20 1.68 6.23 0.00521 0.00162 1.00 1
16 30483979 30534506 50528
ITGAL 0.0045 0.85 17 1.69 5.27 0.00508 0.00158 1.00 1
4 155484108 155492238 8131
FGB 0.0048 0.85 9 2.04 2.19 0.00662 0.00161 1.00 1
16 55423612 55540603 116992
MMP2 0.0073 0.90 11 1.68 3.43 0.00627 0.00195 0.99 1
2 189839046 189877472 38427
COL3A1 0.0075 0.91 25 1.49 8.89 0.00696 0.00247 0.99 1
X 110488331 110513751 25421
CAPN6 0.0079 0.91 11 1.65 3.50 0.00580 0.00185 0.99 1

Interesting! This significant gene set is composed of extracellular matrix genes (COL5A2, ITGAL) that are not significant on their own. This constitutes a true pathway level hit.

Analyze truncating mutations

We can run a similar analysis but choosing only truncating mutations (by subsetting the mutation GRanges). Scoring this new model, we obtain a different mutation list, containing likely candidate drivers with enrichment of frameshift, nonsense, or nonstop indels and SNVs.

## replace events with new subset of mutations (using %Q% subsetting operator from gUtils)
fish$events = mutations %Q% 
   (grepl('(Frame_Shift_)|(Nonsense)|(OutOfFrame)|(Nonstop)', Variant_Classification))

## re-score model and inspect results
fish$score() 
fish$res[order(p), ][fdr<0.25, ]
fish$qqp()
seqnames start end width strand gene_name p fdr count effectsize count.pred count.density count.pred.density p.neg fdr.neg
17 7565097 7590856 25760
TP53 0.0e+00 0.00000 33 7.54 0.1770 0.02930 0.000157 1 1
X 47004268 47046212 41945
RBM10 0.0e+00 0.00014 6 6.06 0.0899 0.01140 0.000171 1 1
3 47057919 47205457 147539
SETD2 1.0e-07 0.00019 18 4.33 0.8920 0.00295 0.000146 1 1
16 3115298 3131908 16611
IL32 2.3e-05 0.05000 4 5.47 0.0905 0.00905 0.000205 1 1
8 126442563 126450647 8085
TRIB1 5.2e-05 0.08600 4 5.22 0.1070 0.00531 0.000142 1 1
1 27022524 27108595 86072
ARID1A 5.9e-05 0.08600 10 3.61 0.8170 0.00176 0.000144 1 1
7 855528 936072 80545
SUN1 1.7e-04 0.21000 5 4.01 0.3110 0.00224 0.000140 1 1

Though TP53, RBM10, SETD2, and ARID1A are well known targets of truncating mutations in lung adenocarcinoma, IL32, TRIB1, SUN1 are interesting candidates that have not previously been associated with lung adenocarcinoma.